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Abstract 

We propose in this paper a new family of kernels to handle times series, 
notably speech data, within the framework of kernel methods which in- 
cludes popular algorithms such as the Support Vector Machine. These ker- 
nels elaborate on the well known Dynamic Time Warping (DTW) family 
of distances by considering the same set of elementary operations, namely 
substitutions and repetitions of tokens, to map a sequence onto another. 
Associating to each of these operations a given score, DTW algorithms 
use dynamic programming techniques to compute an optimal sequence of 
operations with high overall score. In this paper we consider instead the 
score spanned by all possible alignments, take a smoothed version of their 
maximum and derive a kernel out of this formulation. We prove that this 
kernel is positive definite under favorable conditions and show how it can 
be tuned effectively for practical applications as we report encouraging 
results on a speech recognition task. 



1 Introduction 

Defining adequate kernels to handle properly structured objects, and notably 
time series, remains a key challenge for practitioners interested in the application 
of kernel methods to real-life data-sets. While practitioners willing to use kernel 
machines are tempted to apply standard vector kernels on time series, such as 
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the popular Gaussian and polynomial kernels implemented in most toolboxes, 
they are faced with two issues: first, the time series considered in their databases 
might be of variable length, and second, standard kernels for vectors cannot cap- 
ture by construction the local dependencies between neighboring states of their 
time series. On the other hand, a family of similarities based on dynamic pro- 
gramming and well-known to the communities of speech, bioinformatics and 
text-processing has been taken into account to construct kernels, namely the 
Dynamic-Time- Warping (DTW) score 012], the Smith Waterman algorithm 
and the edit-distance [3]. Since all these criteria do take into account the two 
aforementioned issues, practitioners have been tempted to use them directly 
with SVM implementations. However, such distances cannot be translated eas- 
ily into positive definite kernels, which is an important requirement of kernel 
machines in the training phase. Intuitively such distances do not show favorable 
positive defmiteness properties as they rely on the computation of an optimum 
rather than on the construction of a feature map, an issue that was studied in 
both ^J[S] and 0. Building on these references, we propose in this work a new 
family of kernels between time series mostly inspired by the approach of [3]. 
These kernels are positive definite kernels under favorable conditions, but most 
importantly, they incorporate by construction more information on the com- 
pared sequences than the kernels proposed in \I\ |2], while requiring exactly 
the same computational cost. In Section [21 we define such alignment kernels, 
prove their positive defmiteness and show that they can be computed efficiently. 
We follow by presenting in Section [3] experimental results on an isolated- word 
recognition task using a multiclass-SVM setting, where alignment kernels need 
to be rescaled due to a diagonal dominance issue but still show very encouraging 
performances. 

2 Alignment Kernels 

We write IN for the set of natural positive integers, that is {1,2,...}. Let 
x = (xx, . . . , x n ) and y = (j/i, . . . , y m ) be two finite series taking values in a 

def 

state space X, that is two elements of X* = U'^L 1 X\ We define the alignment 
kernel in the following subsection and study its theoretical and computational 
properties in Section l2~2l and Section l2~31 

2.1 A kernel inspired by the soft-max of all alignment 

scores 

An alignment it of length |7r| = p between two sequences x and y is a pair of 
increasing p-tuples (iri , tt2 ) such that 

1 = 7Tl(l) < • ■ • < 7Ti(p) = U, 

1 = tt 2 (1) < ■ • • < tt 2 (p) = m, 
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with unitary increments and no simultaneous repetitions, that is VI < i < p— 1, 

7Tl(i + 1) < 7Tl(i) + 1, TT 2 (j + 1) < 7T 2 (i) + 1, 

(7ri(i + 1) - 7ri(i)) + (7r a (< + 1) - 7ra(i)) > 1. 

We write ^4(x, y) for the set of all possible alignments between x and y. In- 
tuitively, an alignment it between x and y describes a way to associate each 
element of a sequence x to one or possibly more elements in y, and vice-versa. 
Such alignments can be conveniently represented by paths in the n x m grid 
displayed in Figure ^ We would like to outline now an important difference 
between the alignments defined here and those considered in [21 El • T ne pre- 
vious definition for an alignment allows tokens to repeat themselves to handle 
variable-length sequences while [3] and [3] consider gaps instead, that is the 
insertion of a generic wildcard. Gaps make sense in biological sequence analy- 
sis, where insertions and deletions of patterns appear frequently in mutations 
of amino-acid sequences, as well as in spike data where time series are mostly 
binary, while repeated states make more sense in applications such as speech, 
where for instance a vowel might be slightly elongated when uttered by a new 
speaker. This has both theoretical and practical implications, since our algo- 
rithm and its theoretical justification arc slightly different than those exposed 
in Following the well-known principle underlying DTW scores, the authors 
of |2] and PP consider the score: 

M 

where tp is an arbitrary conditionally positive-definite kernel 1 defined on X x X 
(such as minus the squared Euclidian distance in the case where X is Euclidian 
in [2] or directly a Gaussian kernel in pQ). Dynamic programming algorithms 
provide an efficient way to compute the optimal path it* in terms of mean-score 
with respect to <p, 

tt* = argmax - — :S(tt). 

The authors of [2] use a truly c.p.d. kernel (that is non p.d.), namely minus the 
Euclidian distance ip(x,y) = — \\x — y\\ 2 , to define then a "seemingly" p.d kernel 
through exponentiation: 

fcDTWi(x,y) = e^ s ^ ] 

( 1 P \ 

= exp - argmin — V \\x ni ^ ~ f , 

V 7re^(x,y) Fl , =1 I 

while the authors of [P use instead the Gaussian kernel for tp and directly 

1 a, symmetric function ip : X X X — * K is conditionally positive-definite if for any family 
xi, . . . , xjv G X and ci , . . . , cjv £ K such that J2 c i = 0) we have that . CiCj<p(xi, xj) > 
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Figure 1: An alignment tt can be interpreted as a path in the n x m grid 
presented above, filled with the corresponding kernel values fcjj = j/j). The 

optimal path tt* is such that k( x Tn{i)iyir 2 {i)) i s maximal, and corresponds 

in that case to 7Ti = (1,2,2,3,4,5,5,5) and 7T2 = (1,2,3,4,4,5,6,7). Rather 
than considering only the contribution of tt*, wc propose to sum up over all 
possible alignment paths starting from (1,1) and leading to (5, 7), such as the 
ones represented by the two other paths. 



consider 2 the corresponding score S as a kernel: 

1 M 

fc DT w 2 (x,y) = argmax— Ve-^ l|:E 'iW"^2wll 2 . 

7re^(x,y) Fl 

Note that both approaches stem from a c.p.d. kernel tp(x, y) = — \\x — y\\ 2 which 
is cither exponentiated once S is maximized as in or directly exponentiated 
in the definition of S to yield an optimal sum of Gaussian kernels as in [Q. 
In both cases, the authors aim to take advantage of such an exponentiation to 
turn the kernel seemingly positive definite, although this is not insured neither 
in theory nor in practice. We refer to the proofs of 0] and [H] to give the reader 
an intuition of why this is so. 

The kernel we propose is not based on an optimal path chosen given a 
criterion S induced by ip, but takes advantage of all score values {S^tt),-^ € 
_4(x, y)} spanned by all possible alignments. We argue that the following kernel 
is positive-definite under mild conditions and may prove more robust to quantify 

2 we have dropped to improve the readability of this presentation a few more parameters 
that both the authors of 0[U incorporate, but which do not change the overall form of the 
criteria they consider. We take them into account in the experimental section. 
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the similarity of two sequences: 



?reyl(x,y) 7re.4(x,y) 

kl (1) 

7rG.4(x,y) i=l 

where we have written k = e v . Positive definitencss aside, the main motivation 
of Equation (JTJ is to consider the soft-max 3 of the scores of all possible align- 
ments, rather than the simple maximum of the considered criterion. Note that 
the definition of the kernel K does not include the log used in the definition 
of the softmax, but we will see in Section [3] that this logarithm is ultimately 
required in practice. Intuitively, the sum of Equation JTJ quantifies the quality 
of both the optimal alignment and all the alignments which are close to it, just 
as the kernels presented in [S] compare two histograms through the polytope of 
all possible transportation plans which may map one histogram to the other, 
rather than considering the optimal one usually associated with the Monge- 
Kantorovich distance. In the sense of kernel K, two sequences are similar not 
only if they have one single alignment with high score, but do rather share a 
wide set of efficient alignments. 

2.2 Positive Definiteness of the Alignment kernel 

We provide in this section sufficient conditions on k to prove that K is a positive 
definite kernel. 

Theorem 1 Let k be a p.d. kernel such that is positive definite, then K 
as defined in Equation Q is positive definite. 

Proof. For any sequence x = {x\, . . . ,x„) G X* and any sequence a £ M™ we 
write x Q for 

X-a (^1 ; ' ' ' ) %2 ' ^ ' ■ • • ) <En ; ' ' ' j *£n) ■ 
ai times a2 times a n times 

We define further for a sequence x of size n the family s (x) indexed by any 
element s € X* as the quantity 

</> s (x) = card{a G M™ : x a = s}. 

Note for instance that if X = {0, 1}, 0oooi(Ol) = 1 while <foooi(001) = 2. Let 
now k be the following kernel on X* , itself parameterized by an arbitrary p.d. 
kernel \ on X such that \\\ < 1: 



«(x,y) 



niixOi>2/i) ifi x i = iyi 

if|x|/|y| 



3 given a family of positive scalars z = z%, 22, • • • , z n we define the soft-max of z as log e z 



5 



k is trivially a p.d. kernel on the whole of X*: simply note that for any sample 
Xi, . . . ,xjv of points in X* , the matrix [«;(xj,Xj)] can be rearranged in a block 
diagonal form by sorting xi, . . . , xjv in increasing length, with blocks which are 
all positive definite. The kernel K, defined on x, y G X* 

fc(x,y)= Y, E ^(x)^(y) K (s,s') 
sex* s'GX* 

is positive definite by construction, and can be rewritten as 
£( x >y)= E E K(x a ,x b ), 

where n = |x| and m = |y|. We write e for the sequence 1,2,3, ... and given 
a € MP,e a for 

e a = ( !,-•• >1 » V " 2 / ■ • ■ » P> ' ' ' 

ai times 0,1 times a p times 

For two sequences of same length u\ and v\ we write u®v for ((iti, v\), . . . , (u p , 
A couple (a, 6) € W 1 x M m defines a sequence of double indexes e a 55 £6, which 
we use to express K, as 

Nl 

£(x,y) = E n^(( x ' y ) £ .^(*))- 

aei",6ei ra i=i 

IM| = ||!>II 

Note now that for each couple (a, b) there exists a unique alignment 7r and 
an integral vector v of adequate size such that tt v = e a <X> et (7r is namely the 
sequence e a (g) e& stripped of all repeats, recorded in w), and conversely that 
for every couple (tt, v) there exists a unique pair (a, 6) such that n v = e a (g) e^. 
Hence, writing X7r(j) as a short cut for X^r,^), av^i)), we have that 

£(*,y) = E E fix((*>yka)) = E E 

kl 

= E II (**(,-) + X' y) + X' -) + 
kl 



En r 



Setting now \ — Tfk > we recover the expression of Equation JJJ . ■ 

Remark. Kernels k such that is positive definite can be trivially com- 
puted by considering first a kernel x such that |x| < 1 and defining k = 
X % = xl (1 — x)- If % is Euclidian and x is for instance the halved Gaussian 
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kernel he ~^^ x a " , then the kernel 

l e -^\\*-y\\ 2 

can be directly used, and is itself numerically very similar to the Gaussian kernel. 
In practice, most kernels that we considered, including the Gaussian kernel 
and the exponential of the Gaussian kernel, have the property that yields 
positive semidefinite matrices in practice, which in an experimental context will 
be sufficient. 

2.3 Computation and Factorization 

We show in this section that the computation of the alignment kernel K can be 
performed in quadratic complexity, namely in |x||y| iterations, similarly to the 
naive implementation of DTW scores. 

Theorem 2 Given x = (xi , . . . , x n ) and y = (y±, . . . ,y m ) two sequences of 
X* , we set the double-subscripted series Mij as M^o = for i = l,...,n, 
Mo,j = for j = 1, . . . , m, and Mo.o = 1. Computing recursively for € 
{l,...,n} x {1,..., m) the terms 

M id = (Mij-i + A/i-ij-i + Mi- ld ) k(xi, yj ), 

we obtain that K{x 1 y) = M„ iTO 

Proof. The result can be proved by recursion and is intuitively an equivalent 
of the DTW algorithm where the max-sum algebra is simply replaced by the 
sum-product one. ■ 

3 Experiments 

The proposed kernel was tested on the English E-set of the TI46 database, 
which consists of 3724 spoken letters from the set {B,C,D,E,G,P,T,V,Z}. The 
set has a predefined division into a training set and a test-set with 1433 and 
2291 utterances, respectively. From each signal we extracted a sequence of 13- 
dimensional feature vectors with Mel- frequency cepstral coefficients (MFCC), 
hence X is simply M 13 here. The feature vectors were computed every 10 ms 
using a 25 ms wide Hamming window. 

We compare in this section three different methods to predict a letter from a 
signal: first, a conventional HMM approach where we estimate the parameters 
of an HMM model for each letter based on the training set, and use these 
distributions to associate to a sequence in the test-set the letter for which it 
has maximum likelihood. We use a left-to-right HMM model with 6 states and 
5 mixtures in each state with diagonal covariance matrices. The distributions 
were actually estimated using the delta and acceleration coefficients (that is on 
elements of M 39 ), which are known to improve their performance. 
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Our second and third approaches are based on a standard one-vs-all multi- 
class SVM, using the spider-toolbox 4 . We use the kernel proposed in P and 
the alignment kernel with ip defined as the Gaussian kernel. In both cases the 
parameter a G {10, 15, 20, 25, 30, 35, 40} of the Gaussian kernel along with the 
regularization constant C £ {10*, i = —2, . . . , 6} of the SVM's are first selected 
to obtain the best cross validation (CV) error on the training set, estimated on 
4 folds with 4 repeats. Facing exactly the same problem encountered in [3], we 
have to address the fact that the values of the alignment kernel are exceedingly 
diagonally dominant, that is that the value of the kernel fc(x, x) for a point 
against himself is many orders of magnitude larger than fc(x, y). Hence, and al- 
though this operation is known not to conserve positive definiteness, we directly 
use the logarithm of the alignment kernel log if to rescale the obtained values. 
In such a case, we do exactly consider the soft-max of the set of all S(n) values, 
7r spanning A(x,y). The empirical Gram matrix obtained on the training set 
was regularized by adding to it minus its smallest eigenvalue times the identity 
matrix to turn all its eigenvalues positive, while the train versus test-sets kernel 
matrix was left unchanged . The same procedure was also conducted for the 
DTW kernel proposed in [Q which also produces negative eigenvalues. 

We obtained a test error of 11.7% for the HMM approach, 11.5% for the 
kernel of (a = 15, with 10.3% CV error on the training set) and 5.4% for 
the alignment kernel (a = 25 and 4.3% for train CV error), giving to the log- 
alignment kernel a fair edge. The regularization parameter C did not have a 
strong influence on our results when set in the middle range and was fixed at 
C = 1000. To compare further the two kernels, we performed a 4-fold cross 
validation with 4 repeats on the merged train and test sets, and plot in FigureEl 
the CV error of each kernel as a function of a to illustrate the influence of the 
parameter on overall results. We rcimplemented the CV feature of Spider to 
make sure for both kernels that the regularization was only carried out on the 
training-fold Gram matrix. Figure [5] clearly advocates the soft-max perspective 
provided by the log-alignment kernel. 
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Figure 2: 4-fold with 4- repeats cross validation errors on the whole dataset of 
3724 utterances as a function of the cr-Gaussian kernel width for the two studied 
kernels. All CV standard deviations are below a tenth of the presented values. 
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